Skip to content
This repository was archived by the owner on Dec 20, 2025. It is now read-only.

fix(google): Add retry and status polling logic#6379

Merged
jasonmcintosh merged 3 commits intospinnaker:release-1.36.xfrom
armory-io:2025-07-29
Aug 6, 2025
Merged

fix(google): Add retry and status polling logic#6379
jasonmcintosh merged 3 commits intospinnaker:release-1.36.xfrom
armory-io:2025-07-29

Conversation

@mrusan
Copy link
Contributor

@mrusan mrusan commented Jul 31, 2025

The fix addresses an issue we've encountered, where during red-black deployments subsequent steps could execute before autoscaler creation finishes, causing inconsistent behavior and potential failures on delivery.

Now, the async operations use the GoogleOperationPoller which implements proper retry logic and handles operation status polling.

@mrusan mrusan requested review from plumpy and skandragon as code owners July 31, 2025 17:38
// Without waiting for autoscaler creation to complete, subsequent deployment steps (health checks, traffic routing)
// may execute before the autoscaler is active, leading to inconsistent behavior and potential deployment failures.
//
// This fix aligns Spinnaker GCP behavior with Spinnaker AWS behavior, where autoscaling group operations are synchronous.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a big enough change that it warrants a feature flag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SO this is a bug, not a feature. The current implementation gets an async response. Instead of polling for the state of that response, it just accepts if it doesn't error & continues. Turns out... there are cases where that response is a PENDING sorta state that causes failures UNLESS you poll for completion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment isn't exactly correct on the internals of where this fails...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but...how many folks are depending on the current behavior? Or, what if this actually makes things worse (at least temporarily, while we work out the kinks)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SO this LITERALLY can cause failures without this polling. spinnaker/spinnaker#7170 ssee for this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a tough call. I bet there are some setups that are gonna start to fail when this behavior changes -- pipelines are gonna take longer to complete than people expect....even if what people expect is "wrong." In the spirit of reducing blast radius I think a feature flag is good idea, but I won't hold up the PR for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's look at FF. My only concern on FF is...that doing this as a flag may be MORE complicated/risky than without. That said, I understand the concerns!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential middle ground for a feature flag to opt-out rather than in, for people running into issues? With a request to post in Slack if needing to turn it on. If nobody messages after 1/2 releases, rip off the bandaid?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm good with having the feature flag enabled by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattgogerly @dbyron-sf @jasonmcintosh thank you for reviewing the code and your input - I've added a enableAsyncOperationWait flag, enabled by default, along with a log message - haven't found other places where we log such messages, so let me know if there's a better example

registry
)

if (operation) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what circumstances is operation null? Is it worth logging a message or emitting a metric?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous invocation would return null always. Since polling for the state of the operation (i it's there) requires passing oepration information, we have to check if that's null before using the operation. This null check wasn't needed previously because we weren't polling for state previously.

and below where null was always returned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see....Wouldn't the world be a better place if SafeRetry.doRetry let the exception bubble up? Seems like we're effectively swallowing them which doesn't seem great.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree ;) I think that's another area that could use some improvement ... probably separate from this.

@jasonmcintosh jasonmcintosh changed the title chore(google): Add retry and status polling logic for autoscalers and… fix(google): Add retry and status polling logic Aug 4, 2025
@jasonmcintosh
Copy link
Member

FYI I'm i think good with this... so if others are good, think we can get this merged. We're fixing GCP buidls in master now which are currently a bit broken :(

@dbyron-sf dbyron-sf closed this Aug 6, 2025
@dbyron-sf dbyron-sf reopened this Aug 6, 2025
@jasonmcintosh jasonmcintosh merged commit 21102ec into spinnaker:release-1.36.x Aug 6, 2025
21 checks passed
christosarvanitis added a commit to armory-io/clouddriver that referenced this pull request Sep 26, 2025
* fix(google): Add retry and status polling logic (spinnaker#6379)

* chore(google): Add retry and status polling logic for autoscalers and backend services

* chore(google): Added enableAsyncOperationWait=true flag for legacy behaviour

* chore(google): Log enableAsyncOperationWait flag warning just once

* fix(builds): Fixes upload URLs for releases

* chore(accounts-api): Cherry-picks from monorepo to release-1.36.x

---------

Co-authored-by: Maksim Rusan <mrusan@users.noreply.github.com>
Co-authored-by: Jason McIntosh <jason.mcintosh@harness.io>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants